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Abstract 

High-throughput metabarcoding studies on fungi and other eukaryotic micro- 
organisms are rapidly becoming more frequent and more complex, requiring 
researchers to handle ever increasing amounts of raw sequence data. Here, we 
provide a flexible pipeline for pruning and analyzing fungal barcode (ITS 
rDNA) data generated as paired-end reads on lllumina MiSeq sequencers. The 
pipeline presented includes specific steps fine-tuned for ITS, that are mostly 
missing from pipelines developed for prokaryotes. It (1) employs state of the 
art programs and follows best practices in fungal high-throughput metabarcod- 
ing; (2) consists of modules and scripts easily modifiable by the user to ensure 
maximum flexibility with regard to specific needs of a project or future meth- 
odological developments; and (3) is straightforward to use, also in classroom 
settings. We provide detailed descriptions and revision techniques for each step, 
thus giving the user maximum control over data treatment and avoiding a 
black-box approach. Employing this pipeline will improve and speed up the 
tedious and error-prone process of cleaning fungal lllumina metabarcoding 
data. 
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Introduction 

Metabarcoding rapidly gains importance in ecology, espe- 
cially in the ecology of microorganisms that are often 
identifiable only by molecular tools. High-throughput 
metabarcoding provides unprecedented insights into the 
composition of these cryptic communities (Bik et al. 
2012). The opportunities provided by next-generation 
sequencing were rapidly embraced in prokaryotic and 
fungal ecology. Prokaryotes in general are more intensely 
studied than fungi, but arguably fungi play major roles as 
symbionts, pathogens, or decomposers in natural and 



managed ecosystems. For example, fungi play a more 
dominant role in forest litter cellulose decomposition 
than bacteria (Stursova et al. 2012). Beyond playing a key 
role in leaf litter decomposition, fungi may have a knock- 
on effect on other microbes and subsequent carbon 
cycling in freshwaters (Frossard et al. 2012). Along with 
prokaryotes, living fungi are discovered in the most hos- 
tile environments, for example, in millions of years old 
deep-sea sediments (Orsi et al. 2013). Fungi are also 
highly diverse, with over 1.5 million estimated species 
worldwide, most of which are currently not described 
(Hawksworth 2001). It is expected that the field of fungal 



2642 



© 2014 The Authors. Ecology and Evolution published by John Wiley & Sons Ltd. 
This is an open access article under the terms of the Creative Commons Attribution License, which permits use, 
distribution and reproduction in any medium, provided the original work is properly cited. 



M. Balint et al. 



Pipeline for lllumina IVIetabarcoding 



ecology will strongly benefit from fiiture advances in molec- 
ular methods, such as the high-throughput sequencing 
technologies, which are fiindamental for investigating fun- 
gal communities. 

Data processing currently is a bottleneck in metabar- 
coding projects. The number of reads per study has been 
continuously increasing since the introduction of next- 
generation sequencing (NGS) methods, and it is expected 
to rise as sequencing technologies advance. Data process- 
ing must consider the peculiarities of the taxonomic mar- 
ker, the sequencing instrument and chemistry, as well as 
the experimental needs, such as the requirements for sam- 
ple multiplexing. Well-established pipelines are available 
to process metabarcoding data (RDP - Cole et al. 2009; 
MOTHUR - Schloss et al. 2009; QIIME - Caporaso et al. 
2010; PANGEA - Giongo et al. 2010; WATERS - Hart- 
man et al. 2010; GANGS - Pandey et al. 2010). However, 
these tools were mostly developed with the demands of 
prokaryotic metabarcoding in mind, and it is not 
straightforward to use them for the specific requirements 
of fungal metabarcoding, although limited accommoda- 
tions for fungi already exist (e.g., QIIME). This is partly 
due to the peculiarities of the designated fungal barcode 
(Nilsson et al. 2010; Schoch et al. 2012): the internal 
transcribed spacer (ITS) contains hypervariable and highly 
conserved regions and is phylogeneticaUy noninformative 
in distantly related taxa. During data analysis, it is recom- 
mended to separate variable ITS regions from the sur- 
rounding conserved regions, as conserved regions may 
distort BLAST assignments (Nilsson et al. 2010). There 
are pipelines that are specifically suitable for fungal ITS 
metabarcoding, for example GLOTU (Kumar et al. 2011), 
SGATA (http://scata.mykopat.slu.se/), PLUTOF (Abaren- 
kov et al. 2010b). However, all of these were developed 
for 454-sequenced amplicons, and not for other plat- 
forms. Further, these pipelines are provided as web ser- 
vices, limiting the users' possibilities to modify them 
according to specific experimental needs. Pipelines benefit 
from being rapidly adjustable to keep track with fast 
developments in data handling techniques (e.g., designa- 
tion of operational taxonomic units, OTUs - Edgar 
2013). Adapting the published complex and multifunc- 
tional pipelines for fungal metabarcoding to unconven- 
tional sequencing platforms requires not only a deep 
understanding of their functioning but also substantial 
programming skills. Data pruning approaches that are 
not tailored to the users' barcoding marker, sequencing 
instrument and experimental approach can supply only a 
suboptimal raw sequence cleanup. 

Most metabarcoding studies on fungi have employed 
454 pyrosequencing to date. However, recent develop- 
ments suggest that alternative high-throughput platforms 
might be more suitable for answering questions in fungal 



community ecology. This is mainly due to the high read 
numbers that allow thorough replication (Schmidt et al. 
2013). With 454, it is problematic to analyze complex 
biological samples at sufficient sequencing depth, and at 
the same time ensure massive sample replication required 
in microbial ecology (Prosser 2010). In our opinion, the 
following criteria should be considered when selecting a 
sequencing platform: (1) resulting reads need to be long 
enough to contain sufficient variation; (2) both sequence 
ends should be labeled with the same label, and both of 
these labels need to be sequenced to avoid tag switching 
(Carlsen et al. 2012; Lindahl et al. 2013); and (3) the 
sequencer should allow extensive sample replication, so 
many samples can be multiplexed in the same run. Fur- 
thermore, read numbers should be sufficiently high in 
each sample, and distributed relatively evenly among sam- 
ples. We found that the lllumina MiSeq platform satisfies 
these needs at a low cost, and it is thus a viable alterna- 
tive to 454. The lonTorrent platforms may also be suit- 
able candidates for replacing the 454 in the future, but 
Brown et al. 2013 report high error rates in the primer 
region of the reads, where the multiplexing nucleotide 
labels are located. lllumina technology does not yet allow 
for obtaining the entire ITS region (ITSl, 5.8S, and ITS2) 
with overlapping paired-end reads. Currently, the longest 
reads are 2 x 300 bp. This allows for the simultaneous 
sequencing of ITS2 and ITSl of most species, but without 
the possibility to overlap these reads in the 5.8S region. 

After generating one of the first fungal metabarcoding 
data sets on an lllumina MiSeq (Schmidt et al. 2013; 
similar studies Bokulich et al. 2013; McGuire et al. 
2013), we found that existing pipelines were not readily 
usable with our data. It takes a long time to establish a 
new pipeline from scratch for a new sequencing plat- 
form; it is not trivial to determine the optimal number 
and order of steps; it is important to understand each 
step/program and evaluate their potential shortcomings, 
as all scripts and programs might contain programming 
errors. Finally, the methods applied in pruning metabar- 
coding data develop rapidly, and there new techniques 
need to be incorporated into the data treatment process 
on-the-go. Here, we provide a pipeline for cleaning up 
fungal ITS metabarcoding data generated on lllumina 
MiSeq sequencers with a paired-end option. Instead of 
creating a complex, multifunctional pipeline we focused 
on the specific needs of fungal ecologists who intend to 
replace 454 with lllumina for their metabarcoding-based 
research, and who want to keep their data cleaning pro- 
cedures up-to-date. The pipeline is assembled from sim- 
ple, independent steps to facilitate further adaptations to 
the specific needs of the users and the implementation 
of future methodologies. We emphasize that users should 
be able to easily understand, modify, and error-check 
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every data pruning step. The following considerations 
guided our pipeline assembly: 

1 Best practices. We employed state of the art programs 
and adhered to existing recommendations for pruning 
fungal high-throughput data (Nilsson et al. 2011). 

2 Flexibility. Users can modify and replace pruning steps. 
This is of particular importance considering the rapid 
changes in sequencing technology (e.g., increasing read 
lengths), the continuous development of new tools for 
particular problems, and peculiarities of specific molec- 
ular markers or data sets. 

3 User friendliness. The steps in the pipeline were devel- 
oped on an open-source Ubuntu Linux system, and its 
operation requires only basic knowledge of UNIX. 

We found it important to provide detailed descriptions 
and explanations, as well as revision techniques for each 
step to help users understand individual procedures, and 
avoid mistakes. We think that security checks are essential 
when setting up a data cleaning pipeline. Users might be 
more inclined to leave out the control steps with a 
wrapped pipeline. This pipeline will make sequence 
cleanup more efficient for researchers with basic bioinfor- 
matics background and accelerate research in the field of 
fungal community ecology. 

Materials and Methods 

In the following, we use the example of a fungal ITS 
rDNA data set to describe molecular methods and pipe- 
line development. 

Molecular methods 

We collected 96 soil samples from a low-input meadow 
located at Florsheim, Germany (N50° 0' 26.482", E8° 23' 
58.502"). We extracted total DNA from 300 mg of soil 
using the FastDNA SPIN Kit for Soil (MP Biomedicals, 
Santa Ana, CA). We PGR amplified the ITS2 rDNA 
region with the primers ITS3_KY02 and ITS4_KY03 
(Toju et al. 2012). We used two annealing temperatures 
(51°G and 55°G) and three replicated PGRs for each 
annealing temperature to account for the stochasticity of 
PGR reactions (Schmidt et al. 2013). Amplifications were 
carried out in a total volume of 20 j-tL using 10 ng of 
DNA, 4 i^iL of HOT MOLPol Blend Master Mix (Mole- 
gene, Germany), and 0.8 /.(L (10 /mol/L) of each primer. 
PGR conditions were 15 min at 95°G, followed by 30 
cycles of 30 sec at 95°G, 30 sec at either 51°G or 55°G, 
and 30 sec at 72°G, and final elongation for 5 min at 
72°G. PGR products were purified with Agencourt AM- 
Pure XP SPRI magnetic beads (Beckman Goulter, Brea, 
GA). We labeled the primers with 8 bp long tags 
(Kozarewa and Turner 2011) to identify samples after 



multiplexed sequencing. We used the same labels for for- 
ward and reverse primers. Labeling-PGRs were carried 
out in a total volume of 30 i^iL using 20 ng of purified 
PGR product, 6 /(L of HOT MOLPol Blend Master Mix, 
and 1 /(L (10 /imol/L) of each labeled primer. PGR con- 
ditions for this reaction were 15 min at 95°G, followed 
by six cycles of 30 sec at 95°G, 30 sec at 52°G and 
30 sec at 72°G, and final elongation for 5 min at 72°G. 
Amplicons were visualized with gel electrophoresis. After 
purification with Agencourt AMPure XP SPRI magnetic 
beads we normalized and pooled the PGR products. We 
sequenced three amplicon pools on the lUumina MiSeq 
platform using the paired end (2 x 250 bp) option at 
the University of Minnesota Genomics Genter. The frag- 
ment length distribution of the amplicons was analyzed 
on Agilent Bioanalyzer assays at the University of Minne- 
sota Genomics Genter before sequencing. 

Pipeline dependencies 

This pipeline was elaborated and run on an Ubuntu 12.04 
system. The following programs, scripts and data bases 
were used: 

Programs 

1 PANDAseq (MaseUa et al. 2012, https://github.com/ 
neufeld/pandaseq/wiki/PANDAseq- Assembler). 

2 fqgrep (https://github.com/indraniel/fqgrep). 

3 Fastx Toolkit (http://hannonlab.cshl.edu/fastx_toolkit/). 

4 FungallTSextractor (Nilsson et al. 2010, http://emeren- 
cia.org/FungalITSextractor.html). 

5 MEGAN v4 (Huson et al. 2011, http://ab.inf uni-tueb- 
ingen.de/software/megan/) . 

6 USEARGH v7.0.1001 (Edgar 2010, http://drive5.com/ 
uparse/). 

7 BLAST V2.2.27-I- (Altschul et al. 1997). 
Scripts 

1 Reads_Quality_Length_distribution.pl (Supplementary 
Material). 

2 remove_multiprimer.py (Supplementary Material). 

3 demultiplex.sh (Supplementary Material). 

4 rename.pl (Supplementary Material). 

5 fasta_number.py (USEARGH v7, http://drive5.com/ 
python/). 

6 uc2otutab.py (USEARGH v7, http://drive5.com/python/). 
Data bases 

1 GenBank (http://www.ncbi.nlm.nih.gov/genbank/). 

2 UNITE (Abarenkov et al. 2010a). 
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Data bases 

(Supplementary Material, httpi/Zdx.doi. org/10. 12761/ 
SGN.2014.2) 

1 subsampled lUumina MiSeq fastq files (exp02pool02_ 
S1_L00 1_R1_00 1 .fastq, exp02pool02_S 1_L00 1_R2_00 1 .- 
fastq, each 50,000 paired-end, 250 bp-long reads). 

2 primers (primers.txt). 

3 Comma separated value (CSV) files containing sample 
names and labels -I- primers (forward_labels.csv, rever- 
se_labels.csv). 

Notes 

The scripts and commands used here use the following 
file extensions: .fasta for fasta, .fastq for fastq, .csv for 
CSV. The scripts can be edited to comply with different 
extension naming. 

All commands and programs should be added to the 
system path, or they should be present in the same folder 
from where they are called. 

Pipeline for processing lllumina 
metabarcoding reads 

Processing fungal metabarcoding data is prone to compli- 
cations and errors. It is important to consider the charac- 
teristics of the barcoding fragment, the sequencing 
instrument (e.g., error type and frequency), and the par- 
ticular problems that come from PCRs and sample multi- 
plexing. Our pipeline consists of procedures aimed at (1) 
assuring sequence quality; (2) assembling paired-end 
reads; (3) reliable demultiplexing; (4) the separation of 
informative barcode fragments from low-variability frag- 
ments ("ITS extraction"); and (5) the identification of 
fungal OTUs. Sequence quality is assured via the removal 
of low-quality reads, plausible chimeras, and clustering at 
thresholds higher than the expected sequencing error 
frequency. Paired-end assembly of the forward- and 
reverse-sequenced reads is important for the recovery of 
the complete ITS2 fragment, and it also allows reliable 
demultiplexing with multiplexing labels located on both 
ends of the fragments. It is important (and specific to 
fungal barcoding) that the barcode is a noncoding DNA 
fragment, with highly variable (i.e., taxonomically infor- 
mative) and rather conserved parts. It is recommended 
for both clustering, and BLAST searches that only the 
highly variable regions are used (NUsson et al. 2010). 
Reads are generally grouped into OTUs using sequence 
similarity (clustering) before ecological analyses. It is 
important to retain only reads that likely originate from 
the target organisms (but not from, e.g., plants) before 
proceeding with ecological analyses. 



Pipeline for lllumina Metabarcoding 

We defined the order of steps with regard to practical 
considerations. For example, it is practical to do the 
paired-end assembly immediately before removing primer 
artifacts (i.e., as step 2 instead of step 3 in the pipeline), 
otherwise an extra step is needed to ensure that the order 
of reads in the two fastq files is preserved. In general, we 
recommend the following: 

1. Quality filtering 

Raw read pairs are filtered for an average read quality 
threshold with a script provided here. It is important to 
preserve the order of the reads in both forward and 
reverse read files: the paired-end read assembler needs 
corresponding read orders in both files, 
perl Reads_Quality_Length_dist r ibution .pi -f w 
f orward_reads . fastq -rw reverse_reads . fastq -sc 
33 -q26 -1 150 -Id N 

Explanation: -sc lllumina phred score format, -q mean 
quality threshold, -1 reads shorter than this number wUl 
be discarded, -Id give read length distribution. 

Output: forward and reverse fastq files with quality-fil- 
tered sequences. If a sequence is removed from either of 
the files due to low quality, its pair is also removed from 
the other file. 

Recommended checks: fastq files show the desired 
improvement (use, e.g., FastQC for easily checking this, 
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/); 
read loss is low. 

2. Paired-end assembly 

We assembled paired-end reads with PANDAseq (MaseUa 
et al. 2012). The program corrects mismatching bases in 
the overlapping region according to the basecaU with the 
higher quality. 

pandaseq-f f orward_reads . f astq-r 
reverse_reads .fastq-FN-o5> 
pair ed_assemb led . fastq 

Explanation: -¥ preserve fastq format, -N remove reads 
with unknown nucleotides, -o minimum read overlap 
between forward and reverse. 

Output: fastq file containing the paired-assembled 
sequences. The program runs only if the sequence order 
in the forward and reverse fastq inputs is the same. 

Recommended checks: sequence loss is low; both forward 
and reverse primers are present in the expected places; a 
random blast of some reads gives reasonable hits; the 
fragment length distribution (use, e.g., FastQC for this) 
resembles what was expected from the Agilent High Sen- 
sitivity DNA assay chip results from the initial sequencing 
library QC (if available). 
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3. Remove primer artifacts 

We observed that sequences may contain multiple primer 
occurrences. Primers were sometimes found in the middle 
of the sequence, sometimes multiple times at the ends of 
the sequences. Sequences with such primer artifacts had 
lengths similar to the expected length of the ITS2 frag- 
ments. They were not observed with gel electrophoresis, 
and with Agilent Bioanalyzer assays. These sequences 
should be removed with a script supplied here. The script 
cannot handle nucleotide ambiguities in the primer 
sequence. 

python remove_multiprimer .py -i input . f astq -o 
output . fastq -f <f orwardPrimerSequence> -r 
<rever sePr imerSequence> 

Output: fastq file with sequences that do not contain 
primer multimers. 

Recommended checks: search for the location of primers 
in the sequence data, for example with the command grep. 

4. Reorient reads to 5'-3' 

Paired-end reads from the lUumina platform are ran- 
domly attached to the sequencing lane, so that the output 
files contain approximately 50% reads in each direction. 
These reads must be reoriented in 5'-3' direction for all 
downstream steps. This can be done using grep-type 
commands to separate reads containing the forward and 
reverse primers. Some grep-type commands allow 
mismatches in the search strings. We use fqgrep, which 
is specifically developed for manipulating fasta and 
fastq textfiles. Then 3'-5' reads can be easily reverse 
complemented (we use a Fastx Toolkit command, 
fastx_reverse_complement for this), 
fqgrep -mN -p 'f orward_primer_sequence' -e 
pair ed_assembled_good . f astq > good_5-3 .fastq 
fqgrep -mN -p 'reverse_primer_sequence' -e 
paired_assembled_good. f astq> good_3-5 . fastq 

Explanation: -m allow N mismatches, -p Pattern of 
interest to grep, -e allow logical expressions (e.g., for use 
with ambiguous sites [AGCT]). 

Output: two fastq files, one containing reads sequenced 
in 5'-3' direction, the other containing reads sequenced in 
3'-5' direction during the first sequencing run. 
f astx_reverse_complement -Q33 -i good_3-5 . fastq 
» good_5-3 . fastq 

Explanation: -Q33 format of quality scores in the fastq 
(if needed). 

Output: fastq file with all reads reoriented into 5'-3' 
direction. 

Recommended checks: randomly check the beginnings 
and ends of sequences for presence of correct primers. 
This can be done using the standard commands less, grep, 
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head, tail, or in a sequence alignment viewer (SeaView is 
a simple and fast option on Linux, Gouy et al. 2010). The 
fastq file needs to be converted to fasta with for example, 
the fastq_to_fasta of the FASTX Toolkit (http://hannon- 
lab.cshl.edu/fastx_toolkit/index.html). 

5. Demultiplexing 

We retain only those reads that contain a perfectly 
matching primer + label combination on both ends. We 
use a script that relies on fqgrep (https://github.com/in- 
draniel/fqgrep). The script can be modified according to 
the fqgrep manual to allow more complicated search pat- 
terns and fastq output. Currently the output files are in 
fasta format. 

bash demultiplex . sh f orward_labels . csv 
r ever se_lab els . csv 5-3_or lent ed .fastq 

Explanation: forward_labels.csv: CSV (comma sepa- 
rated value) file containing sample names and the 5'-3' 
orientation label -i- primer sequences (see sample files); 
reverse_labels.csv: CSV file containing sample names and 
the 3'-5' orientation label + primer sequences (see sam- 
ple files); 5-3_oriented.fastq: sequences reoriented to 5'- 
3' direction. 

Output: separate fasta files, each corresponding to a 
sample. 

Recommended checks: check random demultiplexed 
samples for the presence of labels + primers at the 
expected ends; check whether names correspond to pri- 
mer combinations; all expected samples are retained; read 
number differences among samples are not substantial. 

6. Pool files and remove primers and labels 

The name of the samples is inserted into the headers of 
the fasta reads. Once the sequence headers contain the 
sample names, the samples can be pooled, and the prim- 
ers and labels can be removed. We use a command from 
the Fastx Toolkit to trim labels and primers, 
perl rename .pi 

Explanation: the script introduces the file names into 
the headers of the fasta reads. The script has to be run in 
the folder containing the sample fasta files. 

Output: separate fasta files, each corresponding to a 
sample. The file names (sample names) are included in 
the header of each sequence, 
cat *.fasta» comb ined_samples . fasta 

Explanation: combines samples. 

Output: fasta file with pooled sequences from each 
sample. The sample identity is preserved in the sequence 
headers. 

f astx_trimmer -f 27 -i combined_samples . fasta -o 
head_tr immed .fasta 
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f astx_trimmer -t 25 -i he ad_t rimmed. f asta -o 
tr immed . fast a 

Explanation: -f first base to keep, -t trim N reads firom 
the end of the sequences (these two steps cannot be com- 
bined in the fastx_trimmer; our primer + label combina- 
tions are 26 bp long). 

Output: fasta file with forward and reverse primers, and 
multiplexing labels removed. 

Recommended checks: Check the beginning and end of 
the reads to ensure that primers were removed com- 
pletely, for example, in SeaView; check if combined read 
numbers make sense. 

7. Extract fungal ITS 

Internal transcribed spacer amplicons contain conserved 
regions of the SSU (small rRNA subunit gene), 5.8S and 
LSU (large rRNA subunit gene), where the primers are 
located. If not removed, these conserved regions may 
bias clustering and BLAST searches, because they 
increase similarity among sequences (NUsson et al. 
2010). The FungallTSextractor (Nilsson et al. 2010) 
extracts ITSl and/or ITS2 from the reads, or discards 
reads if they do not match the structural ITS model. 
The script shortens long fasta headers, so it is important 
to remove unnecessary information from the headers 
before extracting the ITS. The redundant information 
generally refers to the properties of a sequencing run. 
The truly sequence-specific parts of a header are the 
physical coordinates of the DNA fragment cluster on the 
sequencing lane. The sed UNIX command removes 
redundant information from the sequence headers. The 
time required for ITS extraction can be reduced if the 
input file is split into several parts which are run sepa- 
rately, and the results are combined again, 
sed 's/<redundant inf ormat ion>//g' tr immed . fasta > 
tr immed_named . fasta 
perl FungallTSExtractor .pi 

Explanation: The file with the unique sequence headers 
has to be copied into the indata folder of the FungallT- 
SExtractor tool. The results are in the outdata folder. 

Output: fasta files with extracted ITS sequences in the 
outdata folder of the FungallTSExtractor. Sequences not 
corresponding to the ITS sequence model are stored in 
separate files. 

Recommended checks: aligning a number of reads in 
some samples does not show conserved alignments at the 
beginning and end of the reads; randomly picked 
extracted ITS sequences give positive (5'-3') hits in the 
UNITE/GenBank. Check the headers of the resulting fasta 
file to ensure that it preserves enough information on 
sequence identity. 



8. Similarity clustering 

The clustering is based on the uparse pipeline (Edgar 
2013) of the USEARCH v7 (Edgar 2010). The steps 
include (1) grouping of replicate sequences; (2) sorting 
sequences according to decreasing abundance; and (3) 
OTU identification and de-novo chimera filtering, 
usear ch -der ep_full length ITS2. fasta -output 
derep . fasta -size out 

Explanation: -sizeout adds the number of replicate 
sequences into the header of their representative 
sequence. 

Output: fasta file with unique ITS sequences, 
usear ch -sortbysize derep. fasta -output 
sorted. fasta -minsize 2 

Explanation: -minsize removes representative sequences 
that are present less than n times. We generally remove 
singletons during this step. 

Output: fasta file without singletons (or doubletons, tri- 
pletons, etc., by choice). 

usearch -cluster_otus input_f ile . fasta -otus 
output_f ile_otus .fasta -otuid 0.97 

Explanation: -otuid specifies the clustering threshold. 

Output: fasta file with the centroid sequences of 
the clusters generated at the specified similarity thresh- 
old. 

Recommended checks: FoUow read loss by summarizing 
the size information from the headers of the cluster-rep- 
resentative sequences, for example 

grep ">" otus . fasta | sed "s/size=/ , /g" > header s . csv 
Open headers. CSV in a spreadsheet program and sum 
the read numbers. 

9. Reference-based chimera filtering 

We perform a second chimera filtering step using 
USEARCH v7 (Edgar 2010). The database of the plausible 
parent sequences is generated from the UNITE fungal ITS 
database (Abarenkov et al. 2010a). 

usearch -makeudb_usear ch UNITE_input_database . 
fasta -output UNITE .udb 

Explanation: ITS fasta sequences from UNITE are con- 
verted into a reference database. 

Output: USEARCH-formated database file for refer- 
ence-based chimera filtering. 

usear ch -uchime_ref otus.fasta-db UNITE . udb 
-nonchimeras otus_good . fasta -chimeras 
otus_chim. fasta -strand plus 

Explanation: -strand plus: the filtering is correct only if 
DNA sequences are in 5'-3' direction, and if this is explic- 
itly specified. 

Output: fasta files containing sequences deemed non- 
chimeric or chimeric. 
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Recommended checks: Follow chimeric sequence num- 
bers: generally chimera formation is not extensive, 
although some chimeras are found. A BLAST search of 
randomly picked chimeric sequences suggests that these 
are indeed chimeras. False positives are possible. Nilsson 
et al. (2012) present useful ideas to identify chimeras with 
BLAST searches. 

10. Identify fungal OTUs 

Primers used for fungi may (and do) amplify sequences 
from nontarget organisms; for example, plants. OTUs that 
do not belong to the target taxa must be discarded. We 
BLAST OTU representative sequences against the GenBank 
nucleotide database (nt). Although not specifically curated 
for fungal ITS sequences (in contrast to the fungal ITS-spe- 
cific UNITE (Abarenkov et al. 2010a), blasting against the 
nt will give information also about the proportion of 
diverse taxonomic groups that may have been co-amplified 
with fungal target sequences. We use this step to select 
OTUs of fungal origin for downstream ecological analyses. 
We do not attempt taxonomic assignments of the OTUs in 
this step. The most current nt database can be obtained 
with an NCBI download script (update_blastdb.pl, part of 
the BLAST-h). We use MEGAN (Huson et al. 2011) to 
parse the BLAST results, and to retain the ITS sequences of 
the fimgal OTUs. 

blastn -db /database/GenBank/nt -query 
input . fasta -outfmt 5 -out output. xml 
-num_threads=N 
-evalue 0 . 001 

Explanation: -outfmt 5 specifies xml as output format, 
-num_threads allows to use multiple processors, -help 
gives a detailed list of options. 

The xml, and the blasted fasta file, is imported into ME- 
GAN. The lowest common ancestor assignments depend 
on several options, our choices for lUumina paired-end 
reads are minimum reads 1, minimum score 170, upper 
percentage 5, no minimum complexity, no min complex- 
ity (0). We uncoUapse all branches, select Fungi, and from 
the Select menu select Subtree. Reads should be exported 
from the File menu (File/Export/Reads). 

Output: xml file containing the BLAST hits of the OTU cen- 
troid sequences, rma file containing the parsed BLAST results, 
fasta file containing the exported fungal OTU sequences. 

Recommended checks: there are no issues with the fasta 
header formatting of the OTUs when inspecting taxon 
assignments; the number of OTUs that MEGAN shows 
as fungi corresponds with the number of OTU 
sequences exported into a fasta file; count number of 
reads in the resulting fungal_otus.fasta file and compare 
with the MEGAN counts; check if sequence headers are 
complete. 
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11. Fungal OTU abundance table 

The generation of the fungal abundance table is based on 
python scripts supplied with USEARCH v7 and includes 
the following steps: (1) label OTU-representative 
sequences according to a pattern (here: OTU_l, OTU_2, 
etc.); (2) map original, not clustered reads (here: ITS2 
sequences generated by FungallTSExtractor) against the 
fungal OTU-representative sequences; (3) specify the sam- 
ple information in the mapping results; and (4) generate 
the OTU table from the mapping results, 
python f asta_number .py f ungal_otus .fasta OTU_ > 
fungal_otus_numbered. fa 
usearch -usearch_global ITS2 . fasta -db 
f ungal_otus_numbered . fa -strand plus -id 0. 97 -uc 
f ungaLreadmap . uc 

sed -i 's /REV/bar codelabel=REV/g' f ungal_r eadmap . uc 

Explanation: The name of the samples has to be speci- 
fied in the readmap file (barcodelabel=). Our sample 
naming scheme allows for a simple sed-based modifica- 
tion of the readmap. Alternatively, the USEARCH v7 
script fastq_strip_barcode_relabel.py may also be used to 
specify the sample names. 

Output: fasta file with numbered OTU sequences. Text 
file (readmap. uc) containing the results of OTU centroid 
sequence mapping against the original ITS file, 
python uc2otutab .py f ungal_readmap .uc > 
f ungaLotu_t able . txt 

Output: Text file (tab-delimited) containing the OTU 
abundance table of samples. 

Recommended checks: Check read sums in the abun- 
dance table according to samples/OTUs. 

Results 

The three sequencing runs resulted in ~40 million paired- 
end reads (Table 1). Reads were relatively evenly distrib- 
uted among the 80 multiplexed samples (Fig. 1). The 
pruning of the raw sequence data resulted in considerable 
read losses: ~30 million reads were lost during paired-end 
assembly, reorientation, demultiplexing, and chimera 
checking, and only -10 million reads were considered 
correct ITS2 sequences after ITS extraction. Most reads 
were lost during the demultiplexing step. PhiX DNA frag- 
ments were added to the sequencing reactions as a stan- 
dard practice during the lllumina sequencing of 
amplicons, but the PhiX sequences were removed during 
the sequence postprocessing already by the sequencing 
facility. Thus, it is not the presence of the PhiX fragments 
that causes the read losses observed during demultiplex- 
ing. Finally, about 6 million of the pruned reads were 
assigned to fungi, while most of the remaining sequences 
could not be taxonomically assigned (Fig. 2). 
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Figure 1. Frequency of soil samples in relation 
to the number of lllumina MiSeq reads 
allocated to each sample. The majority of 
samples contained between 40,000 and 
70,000 lllumina MiSeq reads. 
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Figure 2. Distribution of lllumina IVliSeq reads 
and 97% OTUs across different taxonomic 
groups. Approximately, 5.9 million reads are 
assigned to fungi (from a total of >10 million 
reads). From a total of 14,636 OTUs 3208 
could be assigned to fungi. 



The ~10 million pruned ITS2 reads were clustered into 
16,623 operational taxonomic units (OTUs) at 97% 
sequence similarity by the heuristic clustering algorithm 
of USEARCH v7. Of these, 1842 OTUs were deemed chi- 
meric by the USEARCH-based de-novo chimera filtering, 
and 145 by the reference-based chimera filtering. The two 
chimera-checking steps suggest that -11.95% of aU OTUs 
were chimeric. Of the remaining 14,636 OTUs, only 3208 
were assigned to fungi (Fig. 2). Most OTUs could not be 
assigned to any known taxon. 

Discussion 

The pipeline presented here fills an important gap as the 
first collection of practical steps for cleaning fungal ITS 
metabarcoding data generated as paired-end reads on lllu- 
mina MiSeq sequencers. In our opinion the lllumina MiSeq 
platform is a likely successor of 454 pyrosequencing in the 
metabarcoding of fungi, at least until the quality issues of 
the lonTorrent platforms are solved (Brown et al. 2013). 



Given the importance of fungi in almost every natural sys- 
tem, we expect rapid advances on this field. 

We benchmarked our pipeline with a large ITS2 meta- 
barcoding dataset generated as paired-end, 2 x 250 bp 
long reads during three MiSeq runs. We experienced con- 
siderable read losses during data cleanup: from the origi- 
nal ~40 million reads only ~6 million was suitable for 
downstream ecological analyses as high-quality fungal 
reads. About 2.8 million sequenced reads contained mul- 
tiple primer occurrences, an error type that we were una- 
ware of from the literature. Most sequences were lost 
during the reorientation and demultiplexing steps: ~20 
million reads did not contain the correct primer and/or 
label sequences on both ends. We note that we were con- 
servative by not allowing any mismatches in the primer 
or label sequences, and this likely contributed to the sub- 
stantial read losses during this step. We were also conser- 
vative with chimeras by applying two (a de-novo and a 
reference-based) chimera filtering steps. The separation of 
the variable ITS region from the conversed surrounding 
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fragments (an important step in processing hypervariable 
ITS data) was the most time-consuming step that 
demanded hundreds of processor hours. The time 
required by this step can be substantially reduced by per- 
forming the ITS extraction after the de-replication of 
sequences (Table 1, step 8A). 

Our pipeline considers all data cleanup steps recom- 
mended for fungal metabarcoding (Nilsson et al. 2011). It 
also considers the specifics of the fungal metabarcode and 
the characteristics of the lUumina sequencing platform. 
Finally, the pipeline addresses not only the known issues 
of metabarcoding (e.g., low-quality reads, chimeras, possi- 
ble multiplexing label switching), but also a new problem 
we encountered: the multiple occurrences of the barcod- 
ing primers in some sequences. The pipeline is highly 
modular and easily modifiable with basic UNIX knowl- 
edge. We also emphasize its transparency, as users should 
understand the way each cleaning step is carried out. We 
included "revision techniques" for each step, as we found 
it important to continuously check the correctness of data 
generated in each step, and to correct the performance of 
each program. This is especially important during the 
"testing" phase of the pipeline, i.e., when modular steps 
need replacement, or when the experimental design 
requires alteration. Many of these checks may seem obvi- 
ous for researchers experienced in data cleanup, but our 
own experience shows that they considerably help inexpe- 
rienced users to understand, perform and verify each step. 
We refrained from providing a wrapper script for the 
entire pipeline, but all steps presented here can be easily 
wrapped up in simple bash scripts, or in workflow 
engines such as Snakemake (Koster and Rahmann 2012). 
Wrapping steps is up to the users, and in our opinion it 
should be done only after being confident that every data 
processing step provides reliable results. 

We see this pipeline as a user-friendly and loose col- 
lection of recommendations and practical instructions for 
analyzing fungal lUumina-based ITS data, rather than a 
final and standalone product. We emphasize the adapt- 
ability of the pipeline to ever-changing user needs and 
technological advances. We expect that our pipeline will 
be modified in the future to deal with new technological/ 
methodological challenges. Every modular step can be 
replaced with alternatives; for example, Trimmomatic for 
quality filtering of paired-end data (Lohse et al. 2012), 
the ITSx (Bengtsson-Palme et al. 2013) for ITS extrac- 
tion, or GramCluster (Russell et al. 2010) for a nondis- 
tance-based clustering. Confirmation for the user 
friendliness of this pipeline came from participants of a 
Masters' course in ecology and evolution at the Goethe 
University of Frankfurt. Students who had never pro- 
cessed this type of data before and were largely inexperi- 
enced in bioinformatics had no difficulties in learning 
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and independently applying the pipeline after a few 
hours of hands-on training. 

According to our experience metabarcoding pipelines 
should be collections of easy-to-adapt simple scripts and 
practical computer commands, rather than complex tools. 
Users should have maximum understanding and control 
over the pruning of their data to avoid mistakes. 
Although complex standalone pipelines have the advanta- 
ges of providing relatively generalized solution to a range 
of problems, they lag behind in the specificity demanded 
by individual research groups and particular experiments. 
We hope that our pipeline will allow researchers to ana- 
lyze their paired-end lllumina fungal ITS metabarcoding 
data more readily. 
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